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O ■ Abstract 

5-H ■ Vision problems ranging from image clustering to motion segmentation to semi-supervised learning 

^-i| can naturally be framed as subspace segmentation problems, in which one aims to recover multiple 

low-dimensional subspaces from noisy and corrupted input data. Low-Rank Representation (LRR), 
^— V ' a convex formulation of the subspace segmentation problem, is provably and empirically accurate 

*^ I on small problems but does not scale to the massive sizes of modem vision datasets. Moreover, 

past work aimed at scaling up low-rank matrix factorization is not applicable to LRR given its non- 
decomposable constraints. In this work, we propose a novel divide-and-conquer algorithm for large- 
scale subspace segmentation that can cope with LRR's non-decomposable constraints and maintains 
f) • LRR's strong recovery guarantees. This has immediate implications for the scalability of subspace 

segmentation, which we demonstrate on a benchmark face recognition dataset and in simulations. 
We then introduce novel applications of LRR-based subspace segmentation to large-scale semi- 
supervised learning for multimedia event detection, concept detection, and image tagging. In each 
case, we obtain state-of-the-art results and order-of-magnitude speed ups. 



> 

QQ . 1 Introduction 

vn 

lO ■ Visual data, though innately high dimensional, often reside on or lie close to a union of low-dimensional subspaces. 

■^ ' These subspaces might reflect physical constraints on the objects comprising images and video (e.g., faces under vary- 

f^ . ing illumination HI] or trajectories of rigid objects ||24|| ) or naturally occurring variations in production (e.g., digits 

ff^ ' hand-written by different individuals 1I12II ). Subspace segmentation techniques model these classes of data by recov- 

ering bases for the multiple underlying subspaces iflOi |6i [tI • Applications include image clustering ifTIl, se gmentation 
of images, video, and motion Il30l l5l l26ll . and affinity graph construction for semi-supervised learning ||32|| . 

One promising, convex formulation of the subspace segmentation problem is the low-rank representation (LRR) pro- 
gram of Liu et al. UJi tl8f| : 



(Z,S) = argmin ||Z|L + A||S||2 ^ (1) 

subject to M = MZ + S . 

Here, M is an input matrix of datapoints drawn from multiple subspaces, ||-||_^ is the nuclear norm, |j-|J2 i is the sum 
of the column £2 norms, and A is a parameter that trades off between these penalties. LRR segments the columns of 
M into subspaces using the solution Z, and, along with its extensions (e.g., LatLRR lfl9ll and NNLRS Il32ll ). admits 
strong guarantees of correctness and strong empirical performance on clustering and graph construction applications. 
However, the standard algorithms for solving Eq. dl) are unsuitable for large-scale problems, due to their sequential 
nature and their reliance on the repeated computation of costly truncated SVDs. 

Much of the computational burden in solving LRR stems from the nuclear norm penalty, which is known to encourage 
low-rank solutions, so one might hope to leverage the large body of past work on parallel and distributed matrix 
factorization iflU l23l [sl |3ll ^M to improve the scalability of LRR. Unfortunately, these techniques are tailored to 
optimization problems with losses and constraints that decouple across the entries of the input matrix. This decoupling 
requirement is violated in the LRR problem due to the M = MZ + S constraint of Eq. ([T]i. 

Instead, we develop, analyze, and evaluate a novel divide-and-conquer approach to scaling up LRR that specifically 
accounts for the non-decomposable structure of the LRR problem. Specifically, our contributions are three-fold: 



Algorithm: We introduce a parallel, divide-and-conquer approximation algorithm for LRR that is suitable for large- 
scale subspace segmentation problems. Scalability is achieved by dividing the original LRR problem into computa- 
tionally tractable and communication-free subproblems, solving the subproblems in parallel, and combining the results 
using a technique from randomized matrix approximation. Our algorithm, which we call DFC-LRR, is based on the 
principles of the Divide-Factor-Combine (DEC) framework 1I21I1 for decomposable matrix factorization but, unlike 
DFC, can cope with the non-decomposable constraints of LRR. 

Analysis: We characterize the segmentation behavior of our new algorithm, showing that DFC-LRR maintains seg- 
mentation guarantees of the original LRR algorithm with high probability, even while enjoying substantial speed-ups 
over its namesake. Our new analysis features a significant broadening of the original LRR theory to treat the richer 
class of LRR-type subproblems that arise in DFC-LRR. 

Applications: We first present results on face clustering and synthetic subspace segmentation which show that DFC- 
LRR demonstrates accuracy comparable to LRR in a fraction of the time. We then propose and validate a novel 
application of the LRR methodology to large-scale graph-based semi-supervised learning. While LRR has been used 
to construct affinity graphs for semi-supervised learning in the past yllljl, prior attempts have failed to scale to the 
sizes of real-world datasets. Leveraging the favorable computational properties of DFC-LRR, we propose a scalable 
strategy for constructing such subspace affinity graphs. We apply our methodology to a variety of computer vision 
tasks - multimedia event detection, concept detection, and image tagging - demonstrating an order of magnitude 
improvement in speed and accuracy that exceeds the state of the art. 

The remainder of the paper is organized as follows. In Section|2]we first review the low-rank representation approach 
to subspace segmentation and then introduce our novel DFC-LRR algorithm. Next, we present our theoretical analysis 
of DFC-LRR in Section[3l Section |4] highlights the accuracy and efficiency of DFC-LRR on a variety of computer 
vision tasks. We present subspace segmentation results on simulated and real-world data in Section|4T| In Section|4|2] 
we present our novel application of DFC-LRR to graph-based semi-supervised learning problems, and we conclude 
in Section|5] 

Notation Given a matrix M e M™'^", we define Ua/Sj\/V|^ as the compact singular value decomposition (SVD) 
of M, where rank(]V[) = r, Sm is a diagonal matrix of the r non-zero singular values and Um G ig™x'" and 
Va/ e M"^'' are the associated left and right singular vectors of M. We denote the orthogonal projection onto the 
column space of M as Pm- 

2 Divide-and- Conquer Segmentation 

In this section, we review the LRR approach to subspace segmentation and present our novel algorithm, DFC-LRR. 

2.1 Subspace Segmentation via LRR 

In the robust subspace segmentation problem, we observe a matrix M = Lq + So G M™'^", where the columns of Lq 
are datapoints drawn from multiple independent subspacesj] and So is a column-sparse outlier matrix. Our goal is to 
identify the subspace associated with each column of Lo, despite the potentially gross corruption introduced by Sq. An 
important observation for this task is that the projection matrix V^,, V^!^ for the row space of Lo, sometimes termed 
the shape iteration matrix, is block diagonal whenever the columns of Lo lie in multiple independent subspaces II Oil . 
Hence, we can achieve accurate segmentation by first recovering the row space of Lo. 



The LRR approach of H17I1 seeks to recover the row space of Lq by solving the convex optimization problem presented 
in Eq. ([T]i- Importantly, the LRR solution comes with a guarantee of correctness: the column space of Z is exactly 
equal to the row space of Lq whenever certain technical conditions are met lUSll (see Sec.[3]for more details). 

Moreover, as we will show in this work, LRR is also well-suited for the construction of affinity graphs for semi- 
supervised learning. In this setting, the goal is to define an affinity graph in which nodes correspond to data points and 
edge weights exist between nodes drawn from the same subspace. LRR can thus be used to recover the block-sparse 
structure of the graph's affinity matrix, and these affinities can be used for semi-supervised label propagation. 

2.2 Divide-Factor-Combine LRR (DFC-LRR) 

We now present our scalable divide-and-conquer algorithm, called DFC-LRR, for LRR-based subspace segmentation. 
DFC-LRR leverages the principles of the DFC framework of ||21il . but extends them to a new non-decomposable 



Subspaces are independent if the dimension of their direct sum is the sum of their dimensions. 



problem. The DFC-LRR algorithm is summarized in AlgorithmlT] and we next describe each step in further detail. 

D step - Divide input matrix into submatrices: DFC-LRR randomly partitions the columns of M into t l- 
column submatrices, {Ci, . . . , Ct}. For simplicity, we assume that t divides n evenly. 

F step - Factor submatrices in parallel: DFC-LRR solves t subproblems in parallel. The zth LRR subprob- 
lem is of the form 

min ||Z,||,+A||S,||2_, (2) 

subject to Cj = MZj + Sj , 

where the input matrix M is used as a dictionary but only a subset of columns is used as observationsQ A typical LRR 
algorithm can be easily modified to solve the problem in Eq. (|2]) and will return a low -rank estimate Zj in factored form. 

C step - Combine submatrix estimates: DFC-LRR generates a final approximation Z'"""-' for the low-rank 
LRR solution Z by projecting [Zi, . . . , Z^j onto the column space of Zi. This standard column projection technique 
from randomized matrix approximation lITsll was also employed by the DFC-Proj algorithm of Il21ll . 

Runtime: As noted in 112111 . many state-of-the-art solvers for nuclear-norm regularized problems like Eq. ([T) 
have ri(7nnfcj\/) per-iteration time complexity due to the rank-fcj\/ truncated SVD required on each iteration. 
DFC-LRR reduces this per-iteration complexity significantly and requires just ©(m/fep.) time for the ith subproblem. 
Performing the subsequent column projection step is relatively cheap computationally, since an LRR solver can return 
its solution in factored form. Indeed, if we define k' = max^ kd, then the column projection step of DFC-LRR 
requires only 0[mk'^ + W^) time. 



Algorithm 1 DFC-LRR 



Input: M, t 

{Ci}i<i<t = SampleCols(M, t) 
do in parallel 

Zi = LRR(Ci,M) 



Zt = LRR(Ct,M) 
end do 

jproo ^ colProj( [Zi , . . . , Zt] , Zi) 



3 Theoretical Analysis 

Despite the significant reduction in computational complexity, DFC-LRR provably maintains the strong theoretical 
guarantees of the LRR algorithm. To make this statement precise, we first review the technical conditions for accurate 
row space recovery required by LRR. 

3.1 Conditions for LRR Correctness 

The LRR analysis of Liu et al. llisll relies on two key quantities, the rank of the clean data matrix Lq and the coher- 
ence 1I22I1 of the singular vectors V^q . We combine these properties into a single definition: 

Definition 1 ((/i, r)-Coherence). A matrix L G p{,™xn /^ (^^ r)-coherent //'rank(L) = r and 

^l|Vl|li^<M, 



^An alternative formulation of DFC-LRR involves replacing both instances of M with Ci in the formulation of Eq. ([T}. The 
resulting low-rank estimate Z, would have dimensions I x I, and the C step of DFC would compute a low-rank approximation on 
the block-diagonal matrix diag(Zi, Z2, . . . , Zt). 
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Figure 1: Results on synthetic data (reported results are averages over 10 trials), (a) Phase transition of LRR and 
DFC-LRR. (b,c) Timing results of LRR and DFC-LRR as functions of 7 and n respectively. 



where 



1 2.CJ0 



is the maximum column ^2 nomiQ 



Intuitively, when the coherence fi is small, information is well-distributed across the rows of a matrix, and the row 
space is easier to recover from outlier corruption. Using these properties, Liu et al. iflSll established the following 
recovery guarantee for LRR. 

Theorem 2 ( lIlSll ). Suppose that M = Lo + So G jj^x" where So is supported on 771, columns, Lq is {-r^,r)- 
coherent, and Lq and Sq have independent column support with range(Lo) H range(So) = {0}. Let X be a solution 
returned by LRR. Then there exists a constant 7* (depending on ^ and r) for which the column space of Z exactly 
equals the row space o/Lq whenever A = 3/(7||M||Vy*I) andj < 7*. 

In other words, LRR can exactly recover the row space of Lq even when a constant fraction 7* of the columns has 
been corrupted by outliers. As the rank r and coherence p shrink, 7* grows allowing greater outlier tolerance. 



3.2 High Probability Subspace Segmentation 

Our main theoretical result shows that, with high probability and under the same conditions that guarantee the accuracy 
of LRR, DFC-LRR also exactly recovers the row space of Lo. Recall that in our independent subspace setting accurate 
row space recovery is tantamount to correct segmentation of the columns of Lq. The proof of our result, which 
generalizes the LRR analysis of llisll to a broader class of optimization problems and adapts the DFC analysis of 1I21I1 . 
can be found in the appendix. 

Theorem 3. Fix any failure probability S > 0. Under the conditions ofThm, |2] let Z^^""^ be a solution returned by 
DFC-LRR. Then there exists a constant 7* (depending on p and r)for which the column space ofL^^"^ exactly equals 
the row space of Jjq whenever X = 3/ (7||Mj|\/7*7) /or each DFC-LRR subproblem, 7 < 7*, and t ~ n/lfor 

I > cr/ilog(4n/(5)/(7* - 7)^ 
and c a fixed constant larger than L 

Thm. [3] establishes that, like LRR, DFC-LRR can tolerate a constant fraction of its data points being corrupted and 
still recover the correct subspace segmentation of the clean data points with high probability. Moreover, when the 
number of datapoints n is large, solving LRR directly may be prohibitive, but DFC-LRR need only solve a collection 
of small, tractable subproblems. Indeed, Thm. |3] guarantees high probability recovery for DFC-LRR even when the 
subproblem size I is logarithmic in n. The corresponding reduction in computational complexity allows DFC-LRR to 
scale to large problem instances with little sacrifice in accuracy. 



4 Experiments 

We now explore the empirical performance of DFC-LRR on a variety of simulated and real-world datasets, first for 
subspace segmentation and next for graph-based semi-supervised learning. For all of our experiments we use the 



' Although llTsh uses the notion of column coherence to analyze LRR, we work with the closely related notion of (/i, r) -coherence 
for ease of notation in our proofs. Moreover, we note that if a rank-r matrix L £ r'"X" ;§ supported on (1 — 7)71 columns then the 
column coherence of Vl is /i if and only ifVi is (/i/(l — 7),r)-coherent. 




Figure 2: Exemplar face images from Extended Yale Database B. Each row shows randomly selected images for a 
human subject. 



inexact Augmented Lagrange Multiplier (ALM) algorithm of lllTll as our base LRR algorithm. For the subspace 
segmentation experiments, we set the regularization parameter to the values suggested in previous works lUSl Il7ll . 
while in our semi-supervised learning experiments we set it to 1/^max {m, n) as suggested in prior workjj In all 
experiments we report parallel running times for DFC-LRR, i.e., the time of the longest running subproblem plus the 
time required to combine submatrix estimates via column projection. All experiments were implemented in Matlab. 
The simulation studies were run on an x86-64 architecture using a single 2.60 Ghz core and 30GB of main memory, 
while the real data experiments were performed on an x86-64 architecture equipped with a 2.67GHz 12-core CPU and 
64GB of main memory. 



4.1 Subspace Segmentation: LRR vs. DFC-LRR 

We first present experiments on subspace segmentation. In both our synthetic and real-world face recognition experi- 
ments, we will see that DFC-LRR grants accuracy comparable to LRR in a fraction of the time. 



4.1.1 Simulations 

We present results on simulated data to illustrate the accuracy and scalability of DFC-LRR. Using an experimental 
setup similar to that described in iflSll, we construct k independent r dimensional subspaces of R™. The basis for 
the first subspace Ui G ^mxr j^ ^ uniformly random matrix with orthonormal columns, and for 1 < i < k, XJi ~ 
RUi_i where R e k^x™ is a uniformly random orthogonal matrix. We generate Ug samples for each subspace with 
Xi G ]gnixn_, denoting the samples from the ith subspace, where X^ = U^T and T G M''^"'^ is a random matrix 
with entries uniformly distributed in [0,1]. We define Xq S ^mxkn, _ j-j^^ 'X.^]- For a given outlier fraction 

7 we next generate an additional Uo — j^kng outlier samples, denoted by S S M.™^"". Each outlier sample has 
independent A/^(0, cr^) entries, where a is the average absolute value of the entries of the kug original samples. Finally, 



we create the input matrix M e 



, where n = kug + no, as a random permutation of the columns of [Xo S] . 



In our first experiments we fix A: = 3, to = 1500, r — 5, and Hg = 200, set the regularizer to A = 0.2, and vary the 
fraction of outliers. We measure with what frequency LRR and DFC-LRR are able to recover of the row space of Xq 
and identify the outlier columns in S, using the same criterion as defined in 1 1 8] rl Figure [TJ a) presents our results, 
showing average performance over 10 trials. We see that DFC-LRR performs quite well, as the gaps in the phase 
transitions between LRR and DFC-LRR are small when sampling 10% of the columns (i.e., t ~ 10) and are virtually 
non-existent when sampling 25% of the columns (i.e., t = 4). 

Next, we present scalability results. Figure[TJb) shows corresponding timing results for the accuracy results presented 
in Figure[Tla). These timing results show substantial speedups in DFC-LRR relative to LRR with a modest tradeoff in 
accuracy as denoted in Figure [Tt a). Note that we only report timing results for values of 7 for which DFC-LRR was 
successful in all 10 trials, i.e., for which success rate equaled 1.0 in Figure[Tla). Moreover, Figure [TJc) shows timing 
results using the same parameter values, except with a fixed fraction of outliers (7 = 0.1) and a variable number of 
samples in each subspace, i.e., Ug ranges from 75 to 1000. These timing results also show speedups with minimal loss 
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'Success is determined by whether the oracle constraints of Eq. ([8} in the Appendix are satisfied within a tolerance of 10" 
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Figure 3: Trade-off between computation and segmentation accuracy on face recognition experiments. All results are 
obtained by averaging across 100 independent runs, (a) Run time of LRR and DFC-LRR with varying number of 
subproblems. (b) Segmentation accuracy for these same experiments. 



of accuracy, as in all of these timing experiments, LRR and DFC-LRR were successful in all trials using the same 
criterion defined in lUSll and used in our phase transition experiments of FigurefTJa). 



4.1.2 Face Clustering 

We next demonstrate the comparable quality and increased performance of DFC-LRR relative to LRR on real data, 
namely, a subset of Extended Yale Database BQ a standard face benchmarking dataset. Following the experimental 
setup in lUTIl . 640 frontal face images of 10 human subjects are chosen, each of which is resized to be 48 x 42 pixels 
and forms a 2016-dimensional feature vector As noted in previous work yfl, a low-dimensional subspace can be 
effectively used to model face images from one person, and hence face clustering is a natural application of subspace 
segmentation. Moreover, as illustrated in Figure |2] a significant portion of the faces in this dataset are "corrupted" by 
shadows, and hence this collection of images is an ideal benchmark for robust subspace segmentation. 



As in [171], we use the feature vector representation of these images to create a 2016 x 640 dictionary matrix, M, 
and run both LRR and DFC-LRR with the parameter A set to 0.15. Next, we use the resulting low-rank coefficient 
matrix Z to compute an affinity matrix U^Uj, where U^ contains the top left singular vectors of Z. The affinity 
matrix is used to cluster the data into k = 10 clusters (corresponding to the 10 human subjects) via spectral embedding 
(to obtain a lOD feature representation) followed by fc-means. Following ||17:1 . the comparison of different clustering 
methods relies on segmentation accuracy. Each of the 10 clusters is assigned a label based on majority vote of the 
ground truth labels of the points assigned to the cluster We evaluate clustering performance of both LRR and DFC- 
LRR by computing segmentation accuracy as in lUTIl . i.e., each of the 10 clusters is assigned a label based on majority 
vote of the ground truth labels of the points assigned to the cluster The segmentation accuracy is then computed by 
averaging the percentage of correctly classified data over all classes. 

Figures |3la) and|3lb) show the computation time and the segmentation accuracy, respectively, for LRR and for DFC- 
LRR with varying numbers of subproblems (i.e., values of t). On this relatively-small data set {n = 640 faces), LRR 
requires over 10 minutes to converge. Meanwhile, DFC-LRR demonstrates a roughly linear computational speedup 
as a function of t, comparable accuracies to LRR for smaller values of t and a quite gradual decrease in accuracy for 
larger t. 



4.2 Graph-based Semi-Supervised Learning 

Graph representations, in which samples are vertices and weighted edges express affinity relationships between sam- 
ples, are crucial in various computer vision tasks. Classical graph construction methods separately calculate the 
outgoing edges for each sample. This local strategy makes the graph vulnerable to contaminated data or outliers. 
Recent work in computer vision has illustrated the utility of global graph construction strategies using graph Lapla- 
cian 121 or matrix low-rank ll32| based regularizers. LI regularization has also been effectively used to encourage 
sparse graph construction 01131]. Building upon the success of global construction methods and noting the connec- 
tion between subspace segmentation and graph construction as described in Section lZT] we present a novel application 
of the low -rank representation methodology, relying on our DFC-LRR algorithm to scalably yield a sparse, low-rank 
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graph (SLR-graph). We present a variety of results on large-scale semi-supervised learning visual classification tasks 
and provide a detailed comparison with leading baseline algorithm. 

4.2.1 Benchmarking Data 

We adopt the following three large-scale benchmarks: 

Columbia Consumer Video (CCV) Content DetectioiQ: Compiled to stimulate research on recognizing highly- 
diverse visual content in unconstrained videos, this dataset consists of 9317 YouTube videos over 20 semantic cate- 
gories (e.g., baseball, beach, music performance). Three popular audio/visual features (5000-D SIFT, 5000-D STIP, 
and 4000-D MFCC) are extracted. 

MED12 Multimedia Event Detection: The MED12 video corpus consists of ~150K multimedia videos, with an 
average duration of 2 minutes, and is used for detecting 20 specific semantic events. For each event, 130 to 367 
videos are provided as positive examples, and the remainder of the videos are "null" videos that do not correspond to 
any event. In this work, we keep all positive examples and sample lOK null videos, resulting in a dataset of 13, 876 
videos. We extract six features from each video, first at sampled frames and then accumulated to obtain video-level 
representations. The features are either visual (1000-D sparse-SIFT, 1000-D dense-SIFT, 1500-D color-SIFT, 5000-D 
STIP), audio (2000-D MFCC), or semantic features (2659-D CLASSEME iH). 

NUS-WIDE-Lite Image Tagging: NUS-WIDE is among the largest available image tagging benchmarks, consisting 
of over 269K crawled images from Flickr that are associated with over 5K user-provided tags. Ground-truth images are 
manually provided for 81 selected concept tags. We generate a lite version by sampling 20K images. For each image, 
128-D wavelet texture, 225-D block-wise LAB-based color moments and 500-D bag of visual words are extracted, 
normalized and finally concatenated to form a single feature representation for the image. 





Figure 4: Trade-off between computation and accuracy for the SLR-graph on the CCV dataset. (a) Wall time of LRR 
and DFC-LRR with varying numbers of subproblems. (b) mAP scores for these same experiments. 



4.2.2 Graph Construction Algorithms 

The three graph construction schemes we evaluate are described below. Note that we exclude other baselines (e.g., 
NNLRS 113211 . LLE graph ll28ll . Ll-graph ^) due to either scalability concerns or because prior work has already 
demonstrated inferior performance relative to the SPG algorithm defined below Il32ll . 

A:NN-graph: We construct a nearest neighbor graph by connecting (via undirected edges) each sample vertex to its 
k nearest neighbors in terms of I2 distance in the specified feature space. Exponential edge weights are associated to 
these edges, i.e., Wij = exp(— d? /ct^), where dy is the distance between Xi and Xj and a is an empirically-tuned 
parameter 1I27I1 . 

SPG: Cheng et al. 101 proposed a noise-resistant Ll-graph which encourages sparse vertex connectedness, motivated 
by the work of sparse representation ll29ll . Subsequent work, entitled sparse probability graph (SPG) llT3ll enforced 
positive graph weights. Following the approach of ll32ll . we implemented a variant of SPG by solving the following 
optimization problem for each sample: 

mill ||x-D^w^||2 +a|lw^||i, s.i. w^ > 0, (3) 
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Table 1: Mean average precision (mAP) (0-1) scores for various graph construction methods. DFC-LRR-10 is per- 
formed for SLR-Graph. The best mAP score for each feature is highlighted in bold. 

(a) CCV 
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(b) MED 12 
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where x is a feature representation of a sample and D^, is the basis matrix for x constructed from its n^. nearest 
neighbors. We use an open-source too0to solve this non-negative Lasso problem. 

SLR-graph; Our novel graph construction method contains two-steps: first LRR or DFC-LRR is performed on the 
entire data set to recover the intrinsic low-rank clustering structure. We then treat the resulting low-rank coefficient 
matrix Z as an affinity matrix, and for sample Xi, the Uk samples with largest affinities to Xi are selected to form a 
basis matrix and used to solve the SPG optimization described by Problem ([3]l. The resulting non-negative coefficients 
(typically sparse owing to the £i regularization term on Wx in ([3])) are used to define the graph. 

4.2.3 Experimental Design 

For each benchmarking dataset, we first construct graphs by treating sample images/videos as vertices and using the 
three algorithms outlined in Section l4.2.2l to create (sparse) weighted edges between vertices. For fair comparison, we 
use the same parameter settings, namely a ~ 0.05 and Uk = 500 for both SPG and SLR-graph. Moreover, we set 
fc = 40 for fcNN-graph after tuning over the range fc = 10 through fc = 60. 

We then use a given graph structure to perform semi-supervised label propagation using an efficient label propagation 
algorithm [27] that enjoys a closed-form solution and often achieves the state-of-the-art performance. We performed 
a separate label propagation for each category in our benchmark, i.e., we run a series of 20 binary classification 
label propagation experiments for CCV/MED12 and 81 experiments for NUS-WIDE-Lite. For each category, we 
randomly select half of the samples as training points (and use their ground truth labels for label propagation) and 
use the remaining half of the points as a test set for evaluation. We repeat this process 20 times for each category 
with different random splits. Finally, we compute Mean Average Precision (mAP) based on the results on the test sets 
across all runs of label propagation. 



4.2.4 Experimental Results 

We first performed experiments using the CCV benchmark, the smallest of our datasets, to explore the tradeoff between 
computation and accuracy when using DFC-LRR as part of our proposed SLR-graph. Figure Ua) presents the time 
required to run SLR-graph with LRR versus DFC-LRR with three different numbers of subproblems {t = 5, 10, 15), 
while FigureUb) presents the corresponding accuracy results. The figures show that DFC-LRR performs comparably 
to LRR for smaller values of t, and performance gradually degrades for larger t. Moreover, DFC-LRR is up to two 
orders of magnitude faster and achieves superlinear speedups relative to LRRQ Given the scalability issues of LRR 
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days. 



We restricted the maximum number of internal LRR iterations to 500 to ensure that LRR ran to completion in less than two 



on this modest-sized dataset, along with the comparable accuracy of DFC-LRR, we ran SLR-graph exclusively with 
DFC-LRR (< = 10) for our two larger benchmarking datasets. 

Table [U summarizes the results of our semi-supervised learning experiments using the three different graph construc- 
tion techniques defined in Section !?. 2. 21 The results show that our proposed SLR-graph approach leads to significant 
performance gains in terms of mAP across all benchmarking datasets for the vast majority of features. These results 
demonstrate the benefit of enforcing both low-rankedness and sparsity during graph construction. Moreover, conven- 
tional low-rank oriented algorithms, e.g., 11321 Il6ll would be computationally infeasible on our benchmarking datasets, 
thus highlighting the utility of employing DFC's divide-and-conquer approach to generate a scalable algorithm. 

5 Conclusion 

Our proposed DFC-LRR algorithm achieves empirical accuracy comparable to LRR while obtaining linear to super- 
linear computational gains, both on subspace segmentation tasks and on novel applications for semi-supervised graph- 
based classification. Moreover, DFC-LRR preserves the theoretical recovery guarantees of LRR. Our algorithm em- 
ploys a divide-and-conquer strategy and extends the principles of the DFC framework to deal with non-decomposable 
LRR-problems. 

DFC-LRR lays the groundwork for developing scalable methods for various LRR-based methods. Indeed, LatLRR 
and NNLRS are leading algorithms for the problems of subspace segmentation and graph semi-supervised learning, 
respectively, and a promising direction for future work involves leveraging DFC-LRR to develop divide-and-conquer 
approaches for these and other LRR-based methods. Moreover, DFC-LRR may also shed light on the development of 
scalable methods for other algorithms for solving convex formulations for subspace segmentation, e.g., li20ll . 
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A Proof of Theorem |3] 

Our proof of Thm. |3] rests upon three key results: a new deterministic recovery guarantee for LRR-type problems 
that generalizes the guarantee of llisll . a probabilistic estimation guarantee for column projection established in 12 111 . 
and a probabilistic guarantee of 1I21I1 showing that a uniformly chosen submatrix of a (/i, r)-coherent matrix is nearly 
(/i, r)-coherent. These results are presented in Sees. lA.ll IA.2J and lA.3l respectivelv. The proof of Thm. |3]follows in 
Sec.lAH 



In what follows, the unadorned norm |||| represents the spectral norm of a matrix. We will also make use of a 
technical condition, introduced by Liu et al. (l§] to ensure that a corrupted data matrix is well-behaved when used as 
a dictionary: 

Definition 4 (Relatively Well-Definedness). A matrix M = Lo + Sq is f5-RWD if 



1 



Isr/vT.VrJi < 



/3||M||- 
A larger value of /? corresponds to improved recovery properties. 

A.l Analysis of Low-Rank Representation 

Thm. 1 of iflHl analyzes LRR recovery under the constraint O = DZ + S when the observation matrix O and the 
dictionary D are both equal to the input matrix M. Our next theorem provides a comparable analysis when the 
observation matrix is a column submatrix of the dictionary. 

Theorem 5. Suppose that M = Lq + Sq £ ]]j"!xn /^ /3-RWD with rank r and that Lq and Sq have independent 
column support with rangc(Lo) H rangc(So) = {0}. Let Sq.c G M™^ be a column submatrix of Sq supported on 7Z 
columns, and suppose that C, the corresponding column submatrix o/M, is (-7-—, r)-coherent. Define 

324/32 

7 - 



324/32 + 49(11 + 4/3) Vr' 
and let (Z, S) be a solution to the problem 

min ||Z||, +A||S||2^^ subject to C = MZ + S (4) 

with A = 3/(7||M||^/7*7). If^ < 7*, then the column space ofX equals the row space o/Lq. 
The proof of Thm. |5]can be found in Sec. |B] 

A.l Analysis of Column Projection 

The following lemma, due to I2M^ shows that, with high probability, column projection exactly recovers a (/^,r)- 
coherent matrix by sampling a number of columns proportional to fir log n. 

Corollary 6 (Column Projection under Incoherence 1I21I Cor. 6]). Let L € ]j"ix" /,g (^^ r)-coherent, and let he € 
^mxi ^g ^ matrix of I columns o/L sampled uniformly without replacement. If I > crfjL log(n) log(l/(5), where c is a 
fixed positive constant, then, 

l = lp™j'^Ul^uI^l 

exactly with probability at least \ — 5. 
A.3 Conservation of Incoherence 



The following lemma of 112 111 shows that, with high probability, Lo,i captures the full rank of Lq and has coherence 
not much larger than ji. 

Lemma 7 (Conservation of Incoherence ||2ll Lem. 7]). Let L e k'^x" ^g (^^ r)-coherent, and let he G R'"^' be a 
matrix of I columns ofh sampled uniformly without replacement. If I > crfilog{n) log{l/S)/e^, where c is a fixed 
constant larger than 1, then he is ( 1JI/2 1 r)-coherent with probability at least 1 — 5 /n. 
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A.4 Proof of DFC-LRR Guarantee 

Recall that, under Alg. [1] the input matrix M has been partitioned into column submatrices {Ci, . . . , Ct}. Let 
{Co,i, . . . , Co,t} and {Sq.i, . . . , So,t} be the corresponding partitions of Lq and So, let Si = -fil be the size of the 
column support of Sg^i for each index i, and let (Z^, S^) be a solution to the ith DFC-LRR subproblem. 

For each index i, we further define Ai as the event that Cq.; is (4/i/(l — 7^), r)-coherent, Bi as the event that Si < 7*/, 
and G(Z) as the event that the column space of the matrix Z is equal to the row space of Lq. Under our choice of 7*, 
Thm. |5]implies that G(Zi) holds when Ai and Bi are both realized. Hence, when Ai and Bi hold for all indices i, the 
column space of Z = [Zi, . . . , Zt] precisely equals the row space of Lq, and the median rank of {Zi, . . . , Z^} equals 
r. 

Applying Cor |6] with 

I > crfilog'^{4n/6)/{-f* - 7)^ > cr/^ log(ri) log(4/(5), 

shows that, given Ai and Bi for all indices i, Zp^°^ equals Z with probability at least 1 — 6/4. To establish G{Z^^) 
with probability at least 1 — (5, it therefore remains to show that 

P{nU{A^ n B,)) = 1 - P{uU{At u BD) (5) 

t 

>1-^(P(AD+P(i3a) (6) 

1=1 

> 1 - 3(5/4. (7) 

Because DFC-LRR partitions columns uniformly at random, the variable Si has a hypergeometric distribution with 
FiSi = jl and therefore satisfies Hoeffding's inequality for the hypergeometric distribution lITil Sec. 6]: 

P(.Sj > Esj + It) < cxp(-2?i2). 

It follows that 

P(S,n = P(s, > 7*0 - P(s, > Es, + 1{Y - 7)) 

< cxp(-2/(7* - 7)2) < S/iU) 

by our assumption that / > cr^ log2(4n/(5)/(7* - 7)^ > log(4t/(5)/[2(7* - 7)^]. 

By Lem.|2]and our choice of 

/>cr/xlog2(4n/(5)/(7*-7)2 

> cr/i log(n) log(4/,5)/(l - 7), 

eachsubmatrixCo,i is (2/^/(1 — 7), r)-coherent with probability at least 1 — 5/ (4n) > 1 — (5/(4t). A second application 
of Hoeffding's inequaUty for the hypergeometric further implies that 

p(-^> -i^) = P(s, <Es.-lil- 7)) 

<exp(-2Z(l-7)2) 

< s/m, 

since / > cr/^log(4n/5)/(7* - 7)^ > log{4t/S)/[2{l - 7)^]. Hence, P{Af) < 6/{2t). 

Combining our results, we find 

t 

^(P(A^)+P(B,^))<3<5/4 
1=1 
as desired. 

B Proof of Theorem m 

Let Iq be the column support of Sqc and let Iq be its set complement in {1, ...,/}. For any matrix S e R"^'' and 
index set I C {1, . . . , fe}, welet 7'i(S) be the orthogonal projection of S onto the space of a x 6 matrices with column 
support!, so that (7'i(S))(^) = S'^), if j G I and (7'i(S))(j) = otherwise. 
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B.l Oracle Constraints 

Ourproof of Thm.|5]will parallel Thm. 1 of lUSll . We begin by introducing two oracle constraints that would guarantee 
the desired outcome if satisfied. 

Lemma 8. Under the assumptions of Thm. |5] suppose that C ~ MZ + S for some matrices (Z, S). If (Z, S) 
additionally satisfy the oracle constraints 

P^tZ^Z and 7'i„(S) = S (8) 

then the column space ofL equals the row space of Jjq. 

Proof By Eq. [8] the row space of Lo contains the column space of Z, so the two will be equal if rank(Lo) = 
rank(Z). This equality indeed holds, since 

Co = Pi^CC) = T'xglMZ + S) = M7'x„=(Z), 

and therefore rank(Lo) = rank(Co) < rank(M7'i5(Z)) < Tauk(Vig{Z)) < rank(Z) < rank(Lo). D 

Thus, to prove Thm.|5] it suffices to show that any solution to Eq.|4]also satisfies the oracle constraints of Eq.|8] 

B.2 Conditions for Optimality 

To this end, we derive sufficient conditions for solving Eq. |4]and moreover show that if any solution to Eq. |4]satisfies 
the oracle constraints of Eq. |8] then all solutions do. 

We will require some additional notation. For a matrix Z G M"^' we define T(Z) = {U^X + YVj : X e 
M'"^', Y e K"'^''}, Vt(z) as the orthogonal projection onto the set T(Z), and Vt{z}^ ^s the orthogonal projection 
onto the orthogonal complement of r(Z). For a matrix S with column support I, we define the column normalized 
version, B{S), which satisfies 

7'ie(6(S)) = and 6(8^^^ ^ S^^VllS^^'^il Vj e I. 

Theorem 9. Under the assumptions of Thm. |5] suppose that C = MZ + S/or some matrices (Z, S). If there exists a 
matrix Q satisfying 

(a) 7't(z)(MTQ) = UzVJ 

(b) ||7't{z)^(MTQ)||<1 

(c) Pi„(Q) = A6(S) 

(d) rxg(Q)ll2,oo<^- 

then (Z, S) is a solution to Eq.^ If in addition, T'xolZ^Z) = 0, and (Z, S) satisfy the oracle constraints ofEq.\8\ 
then all solutions to Eq.^satisfy the oracle constraints ofEq.\8\ 

Proof The proof of this theorem is identical to that of llisl Thm. 3] which establishes the same result when the 
observation C is replaced by M. D 

It remains to construct a feasible pair (Z, S) satisfying the oracle constraints and Vig (Z+Z) = and a dual certificate 
Q satisfying the conditions of Thm.|9] 

B.3 Constructing a Dual Certificate 

To this end, we consider the oracle problem: 

min ||ZL + Al|S|l2_i (9) 

subject to 

C = MZ + S, P^tZ = Z, and Vx„iS) = S. 

Let Y be the binary matrix that selects the columns of C from M. Then (P^t Y, Sg,;) is feasible for this problem, 
and hence an optimal solution (Z*, S*) must exist. By explicitly constructing a dual certificate Q, we will show that 
(Z*, S*) also solves the LRR subproblem of Eq.|4] 
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l)7nxl 



We will need a variety of lemmas paralleling those developed in fl&\. Let 

The following lemma was established in |18|]. 

Lemma 10 (Lem. 8 of iH]). VV^ = V^. Vj.. Moreover, for any A e ] 

7't(z.)(A) = Plt A + APv - Plt APv- 

The next lemma parallels Lem. 9 of iflSll . 

Lemma 11. Ler H = B(S*). Then 

Vl„7'i„(V^) = AP.tM^H. 



Proof The proof is identical to that of Lem. 9 of III8II . D 

Define 

G^rx,(V^){Vx,{Y^)y and V^^||G||. 

The next lemma parallels Lem. 10 of III8II . 
Lemma 12. V< A2||M||^7L 



Proof The proof is identical to that of Lem. 10 of 01811 . save for the size of lo, which is now bounded by 7/. D 

Note that under the assumption A < 3/(7||M||-y77), we have tp ^ 1/4- 

The next lemma was established in llisll . 

Lemma 13 (Lem. 11 of |[il). If^p < 1, then -pi„((Z*)+Z*) = Vi^i^v) = 0. 

Lem. 12 of |18|] is unchanged in our setting. The next lemma parallels Lem. 13 of lIlSll . 



Lemma 14. ||p^.(V ' )||^^^ < . /^^ 



/j.r 



2,00 - Y (i-i)i 

Proof By assumption, C = MZ* + S*, rank(Co) = r, and Vi^^ (C) = Co = Vi^^ (Co). Hence, Cq = Vi^^ (Co) = 
M7'ic(Z*), and thus 

Vj., = Pi5(V5j = SpiUj.,MUz.S^.7'ie(vT.). 
This relationship implies that 

r = rank(Vjj < rank(-pig(Vj.)) < rank(Vj.) = r 

and therefore that Vi" (Vj, ) is of full row rank. The remainder of the proof is identical to that of Lem. 13 of iflSll . 
save for the coherence factor of (1 — 7)/ in place of (1 — 7)71. D 

With these lemmas in hand, we define 

Qi4AP^TMTH = Vi„7'x„(vT) 

00 

Q2 ^ AP(^T).7'is((I + ^(Pv-7'i„P^)^)P^)MHP^ 

= A7'is((I + ^(Py7'i,Pv.)0Pv)P(LT).MHP^, 

where the first relation follows from Lem. [TT] Our final theorem parallels Thm. 4 of llSll . 
Theorem 15. Assume i/j < 1, and let 

Q ^ (M+)^(Vi„V^ + AM^H - Qi - Q2). 

// 

7 ^ /32(1-V)' 



1-7 (3 - ■(/' + I3)^fir 
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(1-VOJi^ 



< A, 



and 



A< ^-^ 



|M||V^(2-V^)' 
then Q satisfies the conditions in Thm. |9] 

Proof The proof of property S3 requires a small modification. Thm. 4 of lllSll establishes that 7'iq(Q) = APa/ H. 
To conclude that Vxa (Q) = AH, we note that S* = C — MZ* and that the column space of C contains the column 
space of M by assumption. Hence, Pj\/S* — S* and therefore Vi^ (Q) — APmH = AH. 

The proofs of properties S4 and S5 are unchanged except for the dimensionality factor which changes from 7i to /. D 



Finally, Lem. 14 of lITsIl guarantees that the preconditions of Thm. [T5]are met under our assumptions on A, 7* , and 



7- 
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